Putting Visual Analytics into Practical Use
In this Take-home Exercise 3, I have explored economic condition (financial health) of the city of Engagement, Ohio USA by using appropriate static and interactive statistical graphic methods in R. The data is processed by using tidyverse family of packages, the statistical graphics are prepared by using tidyverse and the graphs are made interactive by using ggiraph. & plotly
Economic considers the financial health of the city. The financial health of a city is closely intertwined with that of its residents. Hence, the objective of this exercise is to visualize the answers for the following questions:
The dataset used in this exercise is FinancialJournal.csv file which contains information of the amount the residents have spent for each category such as education, food, shelter and recreation. Link to download the dataset is found below
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R pacakges. If they have yet to be installed, we will install the R packages and load them onto R environment.
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('tidyverse','ggplot2','dplyr','plotrix','plyr','patchwork','ggthemes','hrbrthemes',
'trelliscopejs','lubridate','tidyr','CGPfunctions','ggExtra','ggridges','plotly','ggiraph','DT','gganimate','imputeTS','esquisse','GGally', 'parcoords','timetk','viridis','zoo')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports FinancialJournal.csv from the
data folder into R by using read_csv()
of readr and save it as an
tibble data frame called data
# read csv file
data <- read_csv("data/FinancialJournal.csv")
# A tibble: 6 x 4
participantId timestamp category amount
<dbl> <dttm> <chr> <dbl>
1 0 2022-03-01 00:00:00 Wage 2473.
2 0 2022-03-01 00:00:00 Shelter -555.
3 0 2022-03-01 00:00:00 Education -38.0
4 1 2022-03-01 00:00:00 Wage 2047.
5 1 2022-03-01 00:00:00 Shelter -555.
6 1 2022-03-01 00:00:00 Education -38.0
This dataset contains the following information about financial transactions:
Negative values in the amount column represent expenses(resident spent) and positive values represent income (resident gained)
In this section, let’s try to understand how the amount spent by residents change for each category over the period March 2022 - May 2023.
Let’s first manipulate the data and derive new columns for visualization.
Date information is extracted from the timestamp attribute using as.Date().Similarly,
month and year information is extracted from date. The below code chunk
accomplishes this task.
data$date <- as.Date(data$timestamp)
data$month <- factor(month(data$date),
levels=1:12,
labels=month.abb,
ordered=TRUE)
data$year <- year(ymd(data$date))
data$Month_Yr <- format(as.Date(data$date), "%Y-%m")
min(data$date)
[1] "2022-03-01"
max(data$date)
[1] "2023-05-25"
This shows that the dataset has information about the residents from March 2022 to May 2023 (15 months)
As we know that Eduation, Food, Shelter, Recreation comes under expense category and Wage, Rent Adjustment are the income categories, for simplicity, let’s discard the sign and take the absolute value of the amount which will be easier for our comparison analysis. Also, the amount values are rounded off.
The below code performs the task and some of the functions used in
the code chunk are
(abs())[https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/abs]
- Computes the absolute value
(round())[https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Round]
- round off a no. to mentioned decimal digits.
data$amount <- abs(data$amount)
data$amount <- round(data$amount,digits=0)
data <- data[,c(1,2,5,6,7,8,3,4)] # rearranging the columns
head(data)
# A tibble: 6 x 8
participantId timestamp date month year Month_Yr
<dbl> <dttm> <date> <ord> <dbl> <chr>
1 0 2022-03-01 00:00:00 2022-03-01 Mar 2022 2022-03
2 0 2022-03-01 00:00:00 2022-03-01 Mar 2022 2022-03
3 0 2022-03-01 00:00:00 2022-03-01 Mar 2022 2022-03
4 1 2022-03-01 00:00:00 2022-03-01 Mar 2022 2022-03
5 1 2022-03-01 00:00:00 2022-03-01 Mar 2022 2022-03
6 1 2022-03-01 00:00:00 2022-03-01 Mar 2022 2022-03
# ... with 2 more variables: category <chr>, amount <dbl>
Let’s compute the total amount spent/ received by each participant in
each category at each timeframe of the year. Then, the long format data
is converted to wide format using pivot_wider() NA values in
the dataframe are replaced by 0 for accurate calculation. Also, new
attribute called cost_of_living is calculated by adding the
values of amount spent in Food, Education, Shelter,
Recreation.
data_by_year <- data %>%
group_by(participantId, category,Month_Yr) %>%
dplyr :: summarise(Total = sum(amount))
wide_fmt_year <- pivot_wider(data_by_year,names_from = category,values_from =Total)
wide_fmt_year <- na_replace(wide_fmt_year,0)
wide_fmt_year$cost_of_living <- wide_fmt_year$Education + wide_fmt_year$Food + wide_fmt_year$Recreation +wide_fmt_year$Shelter +wide_fmt_year$RentAdjustment
DT::datatable(wide_fmt_year, class= "compact")
Data objects in R can be rendered as HTML tables using the JavaScript library ‘DataTables’. This table is interactive, where you can filter he columns and search the desired values by typing in in the search box at the right.
To get a overall picture of wage over the timeline, wage category from the dataset is filtered and the below code accomplishes the task
wage_data <- data %>%
filter(category=="Wage")
ridge_plt <- ggplot(wage_data, aes(x = amount, y = Month_Yr, fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Amount", direction = -1) +
xlim(0,1000) +
theme(axis.title.y=element_text(angle=0),
axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5,), axis.text = element_text(face="bold")) +
labs(x= "Amount",y= "Time Period",title="How wage is distributed over the period ?")
ridge_plt
Let’s look into some details of timeline chart to understand much about the variations or patterns.The below code chunk helps us to visualise the wage distribution of residents over the entire timeframe. Then excluding the first month as it is difficult to understand the patterns of other months due to its high range. Finally, the wage distribution of March 2022 to see what are the dates when residents are paid high.
options(scipen = 999)
tp1 <- ggplot(date_wage_df, aes(x=date, y=tot_amt)) +
geom_line()+
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),
axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5), axis.text = element_text(size=8,face="bold"))+
labs(x= "Timeframe",y= "Wage",title="Wage distribution over the entire period ")
filt_date_wage_df <- date_wage_df[-1,]
tp2 <- ggplot(filt_date_wage_df, aes(x=date, y=tot_amt)) +
geom_line()+
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),
axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5), axis.text = element_text(size=8,face="bold"))+
labs(x= "Timeframe",y= "Wage",title="Wage distribution excluding March 2022 ")
filt2_date_wage_df <- date_wage_df %>%
filter(date >= "2022-03-01" & date <= "2022-03-31")
tp3 <- ggplot(filt2_date_wage_df, aes(x=date, y=tot_amt)) +
geom_line()+
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),
axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5), axis.text = element_text(size=8,face="bold"))+
labs(x= "Timeframe",y= "Wage",title="Wage distribution of March 2022 ")
tp1/tp2/tp3
This chart reveals that in the month of March 2022, residents are getting paid high especially in the first 7 days March 1-7
Let’s zoom in and look at each participant’s financial health by comparing their income and cost of living at each time frame of the year. Trelliscope makes small multiple displays come alive by providing the ability to interactively sort and filter the plots based on summary statistics computed for each group.
wide_fmt_year %>%
ggplot(aes(Month_Yr, Wage) ) +
geom_point(aes(size=cost_of_living), show.legend = TRUE)+
theme_bw() +
labs(y= 'Income', x='Time Period', fill = "Expense") +
theme(axis.title.y=element_text(angle=0), axis.ticks.x=element_blank(),axis.text.x = element_text(angle=45, hjust=1),
axis.line = element_line(color='grey'), plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5), axis.text = element_text(face="bold")) +
facet_trelliscope(~ participantId, nrow = 1, ncol = 3, width = 600, path="trellis/", self_contained = TRUE)